-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to store and fetch dbt ls cache in remote stores #1147
Conversation
✅ Deploy Preview for sunny-pastelito-5ecb04 canceled.
|
8a08a9c
to
ae0f455
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1147 +/- ##
=======================================
Coverage 96.36% 96.37%
=======================================
Files 64 64
Lines 3443 3480 +37
=======================================
+ Hits 3318 3354 +36
- Misses 125 126 +1 ☔ View full report in Codecov by Sentry. |
8cf1468
to
48e34bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @pankajkoti ! I left some feedback - we can discuss and iterate over it tomorrow in a call to speed up getting this merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pankajkoti , summarising what we agreed:
- Rename the configuration, so we minimize the amount of configuration and are consistent with how we are setting up variables when hosting docs:
- drop
remote_cache_path
and reuse pre-existingcache_dir
- rename
remote_cache_conn_id
tocache_conn_id
- drop
- Avoid exposing the method in
ProjectConfig
- if we could keep cache aux functions in eithercache.py
orgraph.py
, it would be ideal. We should refactor this in the future, as we discussed. If for some reason this refactor lead to the error "this class is not fork safe", it's okay to expose it in theProjectConfig
as a property prefixed with_
. If we manage to get this to work incache.py
orgraph.py
, we can consider decorating it with the lru cache method to avoid re-computing it twice. - Make sure
delete_unused_dbt_ls_cache
works for the remote cache - Do not have a dedicated CI job for this feature, but try to leverage using something like the following in the relevant tests, setting the necessary envvars:
@patch.dict(os.environ, {"AIRFLOW__COSMOS__ENABLE_CACHE": "False"}, clear=True)
. We could also consider running one of the integration DAGs with this enabled, but without the need to create a new CI job.
It's looking great, and it will be very exciting to have this feature out.
Thanks @tatiana for the review. I have managed to address the review comments as per our recent discussion
At this time, we couldn't use the existing
Yes, as per the pairing session, we moved it to
I have created a new function
Yes, I have removed the separate job for now. And as discussed, I will follow-up in a subsequent PR to address this point. Requesting a re-review please. cc: @pankajastro |
New Features * Add support for loading manifest from cloud stores using Airflow Object Storage by @pankajkoti in #1109 * Cache ``package-lock.yml`` file by @pankajastro in #1086 * Support persisting the ``LoadMode.VIRTUALENV`` directory by @tatiana in #1079 * Add support to store and fetch ``dbt ls`` cache in remote stores by @pankajkoti in #1147 * Add default source nodes rendering by @arojasb3 in #1107 * Add Teradata ``ProfileMapping`` by @sc250072 in #1077 Enhancements * Add ``DatabricksOauthProfileMapping`` profile by @CorsettiS in #1091 * Use ``dbt ls`` as the default parser when ``profile_config`` is provided by @pankajastro in #1101 * Add task owner to dbt operators by @wornjs in #1082 * Extend Cosmos custom selector to support + when using paths and tags by @mvictoria in #1150 * Simplify logging by @dwreeves in #1108 Bug fixes * Fix Teradata ``ProfileMapping`` target invalid issue by @sc250072 in #1088 * Fix empty tag in case of custom parser by @pankajastro in #1100 * Fix ``dbt deps`` of ``LoadMode.DBT_LS`` should use ``ProjectConfig.dbt_vars`` by @tatiana in #1114 * Fix import handling by lazy loading hooks introduced in PR #1109 by @dwreeves in #1132 * Fix Airflow 2.10 regression and add Airflow 2.10 in test matrix by @pankajastro in #1162 Docs * Fix typo in azure-container-instance docs by @pankajastro in #1106 * Use Airflow trademark as it has been registered by @pankajastro in #1105 Others * Run some example DAGs in Kubernetes execution mode in CI by @pankajastro in #1127 * Install requirements.txt by default during dev env spin up by @@CorsettiS in #1099 * Remove ``DbtGraph.current_version`` dead code by @tatiana in #1111 * Disable test for Airflow-2.5 and Python-3.11 combination in CI by @pankajastro in #1124 * Pre-commit hook updates in #1074, #1113, #1125, #1144, #1154, #1167 --------- Co-authored-by: Pankaj Koti <[email protected]> Co-authored-by: Pankaj Singh <[email protected]>
Summary
This PR introduces the functionality to store and retrieve the
dbt ls
output cache in remote storage systems. This enhancement improves the efficiency and scalability of cache management for Cosmos dbt projects that use thedbt ls
cache option (enabled by default) introduced in PR #1014Key Changes
dbt ls
Cache Storage in Remote Stores:Added support to store the dbt ls cache as a JSON file in remote storage paths configured in the Airflow settings under the
cosmos
section.The cache is saved in the specified remote storage path & it includes the
cosmos_cache__
prefix.Implemented logic to check the existence of the cache in the remote storage path before falling back to the Variable cache.
If the
remote_cache_dir
is specified and it exists in the remote store, it is read and used; We try creating the specified path if it does not exist.Maintained backward compatibility by allowing users to continue using local cache storage through Airflow Variables if a
remote_cache_dir
is not specified.Impact
Configuration
To leverage this feature, users need to set the
remote_cache_dir
in their Airflow settings in thecosmos
section. This path should point to a compatible remote storage location. You can also specify theremote_cache_dir_conn_id
which is your Airflow connection that can connect to your remote store. If it's not specified, Cosmos will aim to identify the scheme for the specified path and use the default Airflow connection ID as per the scheme.Testing
remote_cache_dir
is not configured.Documentation
Updated the documentation to include instructions on configuring
remote_cache_dir
.Limitations
remote_cache_dir
on an older Airflow version, they will encounter an error indicating the version requirement.closes: #1072
Breaking Change?
No
Checklist